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Abstract — In this work, we investigate the potential utility of 
parallelization for meeting real-time constraints and minimizing 
energy. We consider malleable Gang scheduling of implicit- 
deadline sporadic tasks upon multiprocessors. We first show the 
non-necessity of dynamic voltage/frequency regarding optimality 
of our scheduling problem. We adapt the canonical schedule for 
DVFS multiprocessor platforms and propose a polynomial-time 
optimal processor/frequency-selection algorithm. We evaluate the 
performance of our algorithm via simulations using parameters 
obtained from a hardware testbed implementation. Our algo- 
rithm has up to a 60 watt decrease in power consumption over 
the optimal non-parallel approach. 

I. Introduction 

Power-aware computing is at the forefront of embedded 
systems research due to market demands for increased battery 
life in portable devices and decreasing the carbon footprint 
of embedded systems in general. The drive to reduce system 
power consumption has led embedded system designers to 
increasingly utilize multicore processing architectures. An oft- 
repeated benefit of multicore platforms over computationally- 
equivalent single-core platforms is increased energy efficiency 
and thermal dissipation jjj. For these power benefits to be 
fully realized, a computer system must possess the ability 
to parallelize its computational workload across the multiple 
processing cores. However, parallel computation often comes 
at a cost of increasing the total, overall computation that the 
system must perform due to communication and synchroniza- 
tion overhead of the cooperating parallel processes. In this 
paper, we explore the trade-off between parallelization of real- 
time applications and energy consumption. 

II. Related Work 

There are two main models of parallel tasks (i.e., tasks that 
may use several processors simultaneously): the Gang [2|, 0, 
|U, and the Thread model @, Q, 0. With the Gang 
model, all parallel instances of a same task start and stop 
using the processors synchronously. On the other hand, with 
the Thread model, there is no such constraint. Hence, once a 
thread has been released, it can be executed on the processing 
platform independently of the execution of the other threads. 

Very little research has addressed both real-time paralleliza- 
tion and power-consumption issues 0, flDI ■ Furthermore, 
some basic fundamental questions on the potential utility of 
parallelization for meeting real-time constraints and minimiz- 
ing energy have not been addressed at all in the literature. 



III. Models 

A. Parallel Job Model 

In real-time systems, a job Jg is characterized by its arrival 
time Ag, execution time Et, and relative deadline Dg. The 
interpretation of these parameters is that the system must 
schedule Eg units of execution on the processing platform 
in the interval [At, At + De). Traditionally, most real-time 
systems research has assumed that the execution of Jg must 
occur sequentially (i.e., Jg may not execute concurrently with 
itself on two — or more — different processors). However, 
in this paper, we deal with jobs which may be executed on 
different processors at the very same instant, in which case we 
say that job parallelism is allowed. Various kind of task models 
exist; Goossens et al. |4| adapted parallel terminology ifTTl to 
recurrent (real-time) tasks as follows. 

Definition 1 (Rigid, Moldable and Malleable Job). A job is 

said to be (i) rigid ;/ the number of processors assigned to 
this job is specified externally to the scheduler a priori, and 
does not change throughout its execution; ( ii) moldable ;/ the 
number of processors assigned to this job is determined by the 
scheduler, and does not change throughout its execution; ( Hi) 
malleable if the number of processors assigned to this job can 
be changed by the scheduler during the job 's execution. 

As a starting point for investigating the tradeoff between 
energy consumption and parallelism in real-time systems, 
we will work with the malleable job model in this paper. 
Schedulability analysis is more complicated for the rigid and 
moldable job models, and we defer study of these models for 
future research. 

B. Parallel Task Model 

In real-time systems, jobs are generated by (recurring) tasks. 
One general and popular real-time task model is the sporadic 
task model lfl2ll where each sporadic task is characterized by 
its worst-case execution time a, task relative deadline di, and 
minimum inter-arrival time pi (also called the task's period). 
A task Tj can generate a (potentially) infinite sequence of jobs 
J\, J%, ■ ■ • such that: 1) Ji may arrive at any time after system 
start time; 2) successive jobs must be separated by at least p. L 
time units (i.e., Ag + \ ^ Ag+pi); 3) each job has an execution 
requirement no larger than the task's worst-case execution time 
(i.e., Eg ^ ei); and 4) each job's relative deadline is equal to 
the the task relative deadline (i.e., Dg = di). A useful metric 
of a task's computational requirement upon the system is 
utilization denoted by u,; and computed by e^/p;. A collection 



of sporadic tasks r = {ri, r 2 , . . . , r„} is called a sporadic 
task system. In this paper, we assume a common subclass of 
sporadic task systems called implicit-deadline sporadic task 
systems where each t, € r must have relative deadline equal 
to its period (i.e., dj = pi). 

At the task level, the literature distinguishes between at least 
two kinds of parallelism: 

• Multithread. Each task is sequence of phases, each phase 
is composed of several threads, each thread requires a 
single processor for execution and can be scheduled 
simultaneously |fl3l . A particular case is the Fork- Join 
task model where task begins as a single master thread 
that executes sequentially until it encounters the first fork 
construct, where it splits into multiple parallel threads 
which execute the parallehzable part of the computa- 
tion [7 1 and so on. 

• Gang. Each task corresponds to e x k rectangle where e 
is the execution time requirement and k the number of 
required processors with the restriction the k processors 
execute task in unison 0. 

In this paper, we assume malleable Gang task scheduling. 

Due to the overhead of communication and synchroniza- 
tion required in parallel processing, there are fundamental 
limitations on the speedup obtainable by any real-time job. 
Assuming that a job Jg generated by task r< is assigned 
to kg processors for parallel execution over some i-length 
interval, the speedup factor obtainable is 7^. The inter- 
pretation of this parameter is that over this i-length inter- 
val Ji will complete 7^ • t units of execution. We let 
r< = (7i,0,7i,l)---i7i,m,7i,m+i) denote the multiprocessor 
speedup vector for jobs of task n (assuming m identical 
processing cores). The variables 7^0 and ji. m +i are sentinel 
values used to simplify the algorithm of Section [V] the values 
of 7i,o and 7i m +i are ar, d 00 respectively. Throughout the 
rest of the paper, we will characterize a parallel sporadic task 
n by (ei,pi,Ti). 

We consider the following two restrictions on the multipro- 
cessor speedup vector: 

• Sub-linear speedup ratio ^9|/: 1 < - h2 ^- < K where < 
j < j m. 

• Work-limited parallelism 42/: 7i,(j'+i) — 7i,j' *S 
•fij where j < j' < m. 

The sub-linear speedup ratio restriction represents the fact 
that no task can truly achieve an ideal or better than ideal 
speedup due to the overhead in parallelization. It also requires 
that the speedup factor strictly increases with the number of 
processors. The work-limited parallelism restriction ensures 
that the overhead only increases as more processors are used 
by the job. These restrictions place realistic bounds on the 
types of speedups observable by parallel applications. 

C. Power/Processor Model 

We assume that the parallel sporadic task system r executes 
upon a multiprocessor platform with m identical processing 
cores. The processing platform is enabled with both dynamic 
power management (DPM) and dynamic voltage and fre- 



quency scaling (DVFS) capabilities. With respect to DPM 
capabilities, we assume the the processing platform has the 
ability to turn off any number of cores between and m — 1. 
For DVFS capabilities, in this work, we assume that there is a 
system-wide homogeneous frequency / > which indicates 
the frequency at which all cores are executing at any given 
moment. The power function P(f, k) indicates the power 
dissipation rate of the processing platform when executing 
with k active cores at a frequency of /. We only assume 
that P(f, k) is a non-decreasing, convex function. While we 
consider the setting where the system may dynamically change 
frequency without penalty, we consider that there is significant 
overhead to turning a core on or off. Therefore, in this 
paper, we will only consider core speed/activation assignment 
schemes where the number of active cores is decided prior to 
system runtime and does not change dynamically. 

The interpretation of the frequency is that if t; is executing 
job Jg on kg processors at frequency / over a i-length interval 
then it will have executed t-j%k t ■ f units of computation. The 
total energy consumed by executing k cores over the i-length 
at frequency / is t ■ P(f, k). 

D. Scheduling Algorithm 

In this paper, we use a scheduling algorithm originally de- 
veloped for non-power-aware parallel real-time systems called 
the canonical parallel schedule 0. The canonical scheduling 
approach is optimal for implicit-deadline sporadic real-time 
tasks with work-limited parallelism and sub-linear speedup 
ratio upon an identical multiprocessor platform (i.e., each 
processor has identical processing capabilities and speed). 
In this paper, we consider also an identical multiprocessor 
platform, but permit both the number of active processors 
and homogeneous frequency / for all active processors to be 
chosen prior to system runtime. In this subsection, we briefly 
define the canonical scheduling approach with respect to our 
power-aware setting. 

Assuming the processor frequencies are identical and a fixed 
value /, it can be noticed that a task r, requires more than 
k processors simultaneously if Ui > 7^/. ■ /; for the unitary 
frequency, we denote by ki the largest such k (meaning that 
ki is the smallest number of processors] such that the task n 
is schedulable on fc, + 1 processors at frequency / = 1): 

k del jO if Ui < 7 M 

[max^^lZc I ji jj, < Ui} otherwise. 

For example, let us consider the task system r = {ti,T2} 
to be scheduled on three processors with / = 1. We have 
n = (6,4,ri) with Ti = (1.0,1.5,2.0) and t 2 = (3,4,r 2 ) 
with T 2 = (1.0, 1.2, 1.3). Notice that the system is infeasible 
at this frequency if job parallelism is not allowed since t\ will 
never meet its deadline unless it is scheduled on at least two 
processors (i.e., k\ — 1). There is a feasible schedule if the 
task T\ is scheduled on two processors and r 2 on a third one 
(i.e., fc 2 = 0). 

The canonical schedule: That scheduler assigns ki pro- 
cessors) permanently to and an additional processor spo- 
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radically (see [2j for details). In this work we will extend that 
technique for dynamic voltage and frequency scaling (DVFS) 
and dynamic power management (DPM) capabilities. 

IV. Non-Necessity of DVFS for Malleable Jobs 

Property 1. In a multiprocessor system with global homoge- 
neous frequency in a continuous range, choosing dynamically 
the frequency is not necessary for optimality in terms of 
consumed energy. 

Proof: lfl4ll presented similar result, here we prove the 
property for our framework. Although we have a proof of this 
property for any convex form of P(v) (v is the voltage chosen, 
directly linked to the resulting frequency of the system), 
for space limitation in the following, we will consider that 
P{v) oc v 3 . Assuming we have a schedule at the constant 
speed/voltage v on the (multiprocessor) platform we will 
show that any dynamic frequency schedule (which schedules 
the same amount of work) consumes not less energy. First 
notice that from any dynamic frequency schedule we can 
obtain a constant frequency schedule (which schedules the 
same amount of work) by applying, sequentially, the following 
transformation: given a dynamic frequency schedule in the 
interval [a, b] which works at voltage v\ in [a, £) and at voltage 
i>2 in [£, b] we can define the constant voltage such that at that 
speed/voltage the amount of work is identical. 

Without loss of generality we will consider the constant 
voltage schedule the interval [0, 1] working at voltage v and 
the dynamic schedule working at voltage v + A in [0, 1 ) and 
at the voltage v — A' in [£, 1] . 

Since the transformation must preserve the amount of work 
completed we must have: 

v = £(v + A) + (1 — £)(v - A') 

«- A' d ^ f (2) 

since the extra work in [0,£) (i.e., A£) must be equal to the 
spare work in [£, 1] (i.e., A'(l - £)). 

Now we will compare the relative energy consumed by both 
the schedules, i.e., we will show that 



£A 

£(v + A) 3 + (1 - £){v ~ ^) 3 > i; 3 



(3) 



We know that £(v + A) 3 = £(v 3 + 3v 2 A + 3vA 2 + A 3 ) and 
([3]) is equivalent to (by subtracting v 3 on the both sides) 



3£vA + £A 2 +3v- 



£ 3 A 
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{!-£) {l-£f 
Or equivalently (dividing by £A): 



> 



3Aw + A 2 + 3v- 



£A 



ff2A2 



{I -I) {l-£f 
(v - A' > and, by <|2)) 



> 



3Av + A 2 



£A £A 



£ 2 A 2 



1- £(!-£) {l-£) 2 



> 



3Ai; + A 2 + 2- 



^ 2 A 2 



> 



which always holds because A > and v > 0. ■ 

V. Optimal Processor/Frequency-Selection 
Algorithm 

Property [T] implies, for homogeneous frequency upon the 
different processing cores, that for each DVFS scheduling, it 
exists a constant frequency scheduling which consumes no 
more energy. Thus, the frequency that minimizes consumed 
energy can be computed prior the execution of the system. So, 
in the following, we will design an offline algorithm to find this 
optimal minimal frequency. This parameter will allow us to use 
the canonical schedule [2 1 to find a scheduling of the system. 
First, we will present the feasibility criteria adapted to variable 
homogeneous frequency. After that we will use this criteria to 
determine constraints on the frequency for the system to be 
feasible on a fixed number of processors. After that, we will 
present an algorithm which uses those constraints to compute 
the exact optimal frequency for the system to be feasible. 
Finally, we will prove the correctness of this algorithm. 

In the following we denote by / the frequency of our 
multiprocessor platform. Notice that we made the hypothesis 
that time is continuous (as in |2]). More specifically, we can 
also choose the frequency in the positive continuous range 

(/ e 

A. Background 

Notice that a task n requires more than k processors 
simultaneously if u t > 7^ x /; we denote by the 
largest such k (meaning that ki(f) is the smallest number of 
processors] such that the task Ti is schedulable on ki(f) + 1 
processors): 



**(/) 



clef 







if m < 7i,i x / 
max feLi{^ I li.k x / < Wj}, otherwise. 



(4) 

This definition extends the one of fej, Eq. Notice that 
we have fe, = fcj(l). For a given number of processors 
k € {0, . . . , m — 1, m}, we wish to determine the range of 
frequencies [/i,/ 2 ) such that fc;(/) = k for all / G [/i,/ 2 ). 
We denote as the inverse function 



clef 



- < / < ^} if < K < 



(5) 



.—,00) otherwise. 

L 7i,l ' ; 

We denote the left endpoint (resp., right endpoint) of fc^ 1 (K) 
as fcf x (k)./i (resp., k~ 1 (n).f 2 )- 

B. Feasibility criteria with variable homogeneous frequency 

We will now present a necessary and sufficient condition for 
the feasibility of a task system t on m identical processors at 
frequency / > 0. 

Theorem 1. A sporadic task system r = {t\,T2, ■ ■ ■ ,T n } 
is feasible on an identical platform with m processing cores 
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del' 



at frequency f > if and only if the task system r' — 



> '2) • 



. ,r^} is feasible on the same system with m pro- 



cessing cores at frequency 1. 
t' is defined as follow: 

(Ti.O' ■ ■ • ?7i,m' Ti,m+l) 

V0<fc<ro + 1: 7U= f 7i,fcX/. 

Proof: First of all, it is easy to see that r respects sm£>- 
linear speedup ratio and work-limited parallelism if and only 
if t' respects them also. 

We know that if Tj is executing a job on k processors at 
frequency / over a i-length interval then it will have executed 
t ■ Ji,k ■ f units of computation. For the same interval, t[, at 
frequency 1, is executing t ■ 7^ fc • 1 = f • 7^ • / units of 
computation. The amount of work executed per unit of time 
is exactly the same for every task of both systems. So if there 
exists one schedule without any deadline miss for one of the 
two systems, we can use the same one to schedule the other 
system. Thus, we can conclude that r is feasible if and only 
if t' is feasible. ■ 
Theorem 2. A necessary and sufficient condition for a spo- 
radic task system t (respecting sub-linear speedup ratio and 
work-limited parallelism) to be feasible on to processors at 
frequency f is given by: 



Ui - %ki(f) x / 



(6) 



Proof: By Theorem [T] we know that r is feasible at 
frequency / on to processing cores if and only if r' is 
feasible at frequency 1. In J2[, there is a necessary and 
sufficient feasibility condition for any sporadic task system 
(work-limited and sub-linear speedup ratio) for fixed frequency 
(/ = 1). This result can be used to establish the schedulability 
of t'. 

So, using the result given by 0, we know that r' is feasible 
if and only if this inequation holds: 



TO >Y,[K + ^7 



i=l 



7<,« 



(7) 



where k[ denotes the value of fcj (cf. definition given by 
calculated for the system r' at frequency 1. V 1 ^ i ^ n : 

We can now replace k\ and 7- k by their value in (jTji: 



m 



^E 

»=i ^ 

n , 



Ui ~ JiMf) x / 
7i,fe<(/)+i x / - 7i,fei(/) x / 

Ui - 7i,A<(/) x / 



(7i,fei(/)+i ~ 7i,fci(/)) x // 



which corresponds exactly to (|6j. So r' is feasible if and 
only if d6| holds. Thus, by Theorem [T] r is feasible if and 
only if d6]l holds. ■ 
Definition 2 (Minimum number of processor function for 
parallel tasks and system). For any Ti G r: 

Uj - liMU) x / 

(7i,fc*(/)+l - 7i,fc|(/)) x / 
Therefore, we can define the same notion system-wide: 



MiU) = Hf) 



JUr(/) =y>i(/) 



i=l 



Hf) 



HMf) x / 



(7i,fc i (/)+i ~ 7t,*<(/)) x / 

Based on this definition, the feasibility criteria (|6]l becomes: 

m ^ M T (f) (8) 

Notice that, for a fixed frequency /, the minimum number 
of processors necessary and sufficient to schedule the system 
is \M T (/)] . 

In the following, we will show that in our model, the feasi- 
bility of the system is sustainable regarding the frequency i.e. 
increasing the value of the frequency maintains the feasibility 
of the system. For this, we will need the following theorem. 
Theorem 3. M T (f) is a monotonically decreasing function 
forf>0. 

Proof: We will first prove someting stronger: 

Vn Gr,V/ 1 ,/ 2 eK+: 

< /1 < h => M^fi) > M,(/ a ) (9) 

First, notice that Vr^ e r, is a decreasing staircase 

function. Indeed, the value of ki(f) depends on the satisfaction 
of 7i,fc x / < Ui. In this inequation, the greater / is, the smaller 
7i has to be to hold it and so does k because Tj is ordered 
by model assumption (0 < 7^1 < 7^2 < • ■ • < 7i,m)- 

In order to confirm the decrease of Mj(/), we have to 
consider both cases: 

• When ki(f) remains constant between two frequencies. 

• When fej(/) jumps a step between two frequencies. 
The case when k = m is trivial : / < -7*-, Mi(f) = m 

and task Tj is not schedulable on to processors at this fre- 
quency. So before / = M,(/) is constant and therefore 
monotonically decreases. 

First case to consider is when fcj(/) = K is fixed (k £ 
{0, 1, . . . , to— 1}). By (5 1, we have that for / in 
ki(f) = K remains constant and 

Ui - 74 « x / 



M,(/) = /t + 



(7i,K+i - 7i,«) x / 



7t,ft 



(7i,«+l _ 7i,«) x / 7i,K+l - 7*,« 
decreases as a multiplicative inverse function (terms which 
don't depends on / are fixed in this interval). 

Now consider the case when there is a variation in the value 
of ki(f). This occurs only when / = for k = to, to — 
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1, . . . , 1. At this exact value of the frequency, the value of 
ki(f) jumps from k to k— 1, We will prove that even in those 
cases, the function Mj(/) still decreases. 
We have to prove the following: 

-^^r< — = f => M i (f)>M i (f) 

7i,fc+l li,k 

Let us compute their values individually: 

V 1 k ^ to : 

,li,k, 
Uj 

,li,k, 

7t,*-l x 



h(f) = k 
Mi(f) = M, 



= k-l 



= k-l 



{li,k ~ 7i,fc-l) X 



= k- 1+1 
= fc 

h(f) = k because /' e k^ 1 (k) 
Ui - 7i,fc x /' 



(7»,fc+i - 7i,fc) x /' 



We know ^-7^ x /' > because /' < 7 ljfc+1 > j i k 
is true by model assumption and /' > 0. Thus we have, with 

e > 0: 

Mi(f') = Mi(f)+e 
=► Mi(/') > Mi(f) 

We have now proved (|9j for every possible value of 
Thus: 

/i,/ 2 effi u ~:0</ 1 </ 2 

Mj(/i) > M, ; (/ 2 ) Vr, G r 

n n 

M r (/i) > M r (/ 2 ) 

■ 

This theorem directly implies the following property. 
Property 2. The feasibility of the system is sustainable re- 
garding the frequencj^ 

Proof: By ([8j, t is feasible on m processors at frequency 
/ > 0, if and only if m ^ M T (/). 

If r is feasible on m processors at frequency / > 0, then r 
is feasible on m processors at any greater frequency /' ^ / 
because, by Theorem [3] the following holds: 

/</' and m>M r (/) 
m > Af T (/) > M r (/') 
=*m>M T (/') , 
which corresponds to the feasibility criteria at frequency /'. 



C. Minimum optimal frequency 

Property [2] implies that there is a minimum frequency for 
the system to be feasible. Then, it would be interesting to have 
an algorithm to compute it for a particular task system r and 
a maximum number of processors to. We will first derive a 
constraint on the frequency from the feasibility criteria. After 
that, we will use this constraint to design an algorithm that 
computes the optimal minimum frequency in 0{n 2 \o^'m) 
time. 

Definition 3 (Minimum frequency notation). 

En 



\&(t, to, re) *= 



where re = (re l5 re 2 , . . . , re n ). 

Property 3. If k(f) = (**(/), k 2 (/),..., fc„(/)), then the 
following holds: 

m > M T (f) e> f > *(r, m, fc(/)) (10) 

Proof: Let us define a few more notations: 



£ ( **(/) 
»=i 



7i,A 4 (/)+l -liMf) 



V7i,fc»(/)+i -7i,fe 4 (/) 



A few things to notice: 

• we have M"(f) > because Uj > (tasks aren't 
trivial) and 7^ < 7i,fe+i Vfc G {0, 1, . . . , to} (sub-linear 
speedup ratio). 

. M T (f) = M' T (/) + (with / > 0). 

• the last two items implies that M T (f) > M' T {f) 
We have: 



to ^ M T (f) > M' T (/) 

=>■ to > m;(/) 

&m-M' r {f) > 



(11) 



And: 



to ^ M T (/) 

^to^m;(/) + 

m-MUf) > 



KU) 
f 

M'Jtf) 



>0 by {TT] 

<(/) 



>0 



TO 



*(r,m,*(/)) 



Let us denote by /min the optimal minimum frequency 
such that the system r is feasible on m processors. By 
Property [3] / m/ at is the smallest real positive number / such 
that 

/> *(r,m,A(/)) . (12) 



'i.e., increasing the frequency preserves the system schedulability. 



Consider fixing each fcj(/) term such that they are equal to 
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ki(fMiN)- From there, it would be easy to calculate Jmin 
with the function "J (in 0(n) time). 

The first thing the algorithm will do is then searching 
those values (denoted by Ri, ft2, ■ ■ ■ , ft n such that Ri = 
ki(fMiN) Vr^ G r) and then compute the value of Jmin with 
the expression \&(r, to, R — (k±, k%, . . . «„)). The algorithm 
will be presented in the next section. 

D. Algorithm Description 

Algorithm 1: feasible(r, m, /) 
sum <— 
for T» e t do 

ftj <- &»(/) 

sum ■<— sum + / 
return m ^ sum 

Algorithm 2: minimumOptimalFrequency(r, m) 

for i 6 {1, 2, . . . , n} do 

if feasible(r, m, ^r^) then 
|_ •<— m — 1 
else 

|^ Kj ^— min^LQ 1 !^ | not feasible(r, m, - )} 



"i-7»,*s x/ 
(7i,« i +l-7«,K < )x/ 



_ def ,_ _ _ , 

ft = {Ki,K2,...,K n j 

Imin <- *(r,m, ft) 
return Jmin 



We have designed an algorithm to determine the optimal 
minimum frequency (see Algorithm [2]). The algorithm essen- 
tially systematically searches for the minimum frequency that 
that satisfies the constraints of [6] by calling the feasibility test 
function (AlgorithmfT]). For each value ft that we want to test, 
we determine from (pb the minimum frequency / such that Tj 
requires ft + 1 processors (i.e, / = k~ 1 {n).fi = 1 Ui +1 )■ The 
value of / can be determined in 0(1) time. 

In the feasibility test, we determine the value of fcj(/) 
from frequency / according to Q, which can be obtained 
in 0(log 2 ™) time by binary search over m values. Thus, to 
calculate ki(f) for all Tj G r and sum every Mj(/) terms, the 
total time complexity of the feasibility test is 0(n log 2 m). 

In the main algorithm aimed at calculating /mijv> the value 
of ftj can also be found by binary search and thus takes 
0(log 2 to) time to be computed. This is made possible by the 
sustainability of the system regarding the frequency (proofed 
by Property [2]). Indeed, if r is feasible on to processors 



with ftj (/ 
(/ = 



> 



7»,«i + i 



), then it's also feasible with ftj — 1 



7i,«i 7i,tj + l > 

In order to calculate the complete vector ft, there will 
be 0(nlog 2 TO.) calls to the feasibility test. Since computing 
^ is linear-time when the vector R is already stored in 
memory, the total time complexity to determine the optimal 
feasible frequency is 0(n 2 log 2 to). In order to determine the 
optimal combination of frequency and number of processors, 
we simply iterate over all possible number of active processors 
I = 1,2, ... , to executing Algorithm [2] with inputs r and t. 



We return the combination that results in the minimum overall 
power-dissipation rate. Thus, the overall complexity to find the 
optimal combination is 0(?7m 2 log 2 TO). 

E. An Example 

Let us use the same example system than previously intro- 
duced in Section [iH-D Consider r = {ri, t 2 } to be scheduled 
on to = 3 identical processors. Tasks are defined as follow : 
n = (6, 4, TO with Ti = (1.0,1.5,2.0) and r 2 = (3,4,r 2 ) 
with r 2 = (1.0, 1.2, 1.3). The vector R corresponding to 
this configuration computed by the algorithm is equal to 
(«i = 2, k 2 = 0). This implies that the optimal minimum 
frequency for this system to be feasible on 3 processors is 
equal to f MIN = *(r, 3, (2, 0)) = 0.9375. We can see that if 
we call the feasibility test function for any frequency greater 
or equal than 0.9375, it will return True; it will return False 
for any lower value. 

F. Proof of Correctness 

The efficiency and correctness of the above algorithm de- 



pends upon the theorem presented in Sections V-B and V-C 



Furthermore, the algorithm is correct if the value $ computed 
using the previously calculated vector R is equal to the 



minimum optimal frequency as defined by (12 1. That will be 
the goal of our last theorem. 
Theorem 4. 

fMIN = *(r,m,ft) (13) 
Proof: We will need an auxiliary notion: 

MAf, ft) = ft+ 7 r ft G {0, 1, . . . , to — 1} 

(7i,«+l - H,k) x ./ 

Notice the following: 

M i (f)=M i (f,k i (f)) Vner 
By definition, 
fum = min{/ £ M+ | to > M r (/)} 

= min{/ € K I / > ™> Kf))} by Property [3] 
= W(t, to, fc(/ M /Jv)) ■ 
This is equivalent to 

to = M T {f MIN ) by Property [3] 
We will prove the following: 
VUi^n: Mi(f MIN , R { ) = Mi(f M iN, hUniN)) 

= Mi(fMIN) 

This would imply: 

n 

Mi(fMIN,Ki) = M t (/m/Jv) = "1 

i=l 

<=^ *(r, to, ft) = *(r, to, fc(/ M /Jv)) = /mzjv 

Notice that when ki(fMiN) = ft<, then we have 

Mi(f M m) = Mi(f M iN, fti)- 

We will have four cases of possible value of ft, to investi- 
gate: 
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• Ki = m — 1, the basic case of the algorithm, 

• Ki = 0, 

. k 4 > 0, when / MJJV ^ 

• Kj > 0, when f MIN = 
Basic case, Ki = m — 1: 



feasible(r, m, 



7»,r 



MIN 



^ — < 



<•••< — 

^fi,m ii,m—l 7i,l 
Mi 



ki(fMw) > K ) = m - 1 , 



but for the system to be feasible, we must have ki(fMiN) < 
m, so: 

=^m-U ki(fMiN) < m ki(f M iN) = m - 1 
Complex case, = 0: 

r -11/ Mi \ ., Mi M^ M^ 

-ifeasible(T, m, ) =>■ /a//at > > > . . . > 

7i,l 7i,l 7»,2 7i,m 

Mi , 
=> < JMIN < OO 

7i,i 

h(fMw) = o 

Complex case, Kj > 0: 

^feasible(r, m, — — — ) A feasible(r, m, U ' ) 

7i,S; + l 7i,Ri 
M ? ; 



< Jmin ^ 

7i,Ki+l 7»,Ki 



Case f MIN ^ 



< Jm/jv < 

7i,Ki + l 7i,Ki 

u,: m. 



'7i,«i+l 7i,Ri 
ki(fMIN) — K i 



[cfcr^Ri) 



Case /. 



MIN 



h(fMIN) = Ki - 1 

Mi(fMIN) = Mi{f M IN,ki{fMIN)) 

= .Mi(— 5-,/Si-l) 
7i,«i 

Mi - 7i,K 4 -l x 



(7i,«i -7s,-i) x ^ 



Ki 



Mi(f M IN, Hi) = Ki + 



1 + 1 



Mi - 7i : K, x 



(7i,Ri+i -7«<) x ^ 



VI. Experimental Evaluation & Simulation 

In order to obtain realistic predictions regarding the effect of 
parallelism upon power consumption, we have evaluated our 
algorithm upon an actual hardware testbed. In this section, we 
describe and discuss the high-level overview of the methodol- 



ogy employed in our evaluation, the low-level details involved 
in our evaluation methodology, and the results obtained from 
our experiments. 

A. Methodology Overview 

Realistic predictions of the energy behavior of a real-time 
parallel system using our frequency-selection algorithm re- 
quires a hard-real-time parallel application to execute upon an 
instrumented multicore hardware testbed. In the Compositional 
and Parallel Real-Time Systems (CoPaRTS) laboratory at 
Wayne State University, we have developed a power/thermal- 
aware testbed infrastructure to obtain accurate power and 
temperature readings. Thus, we may obtain realistic hardware 
power measurements for any application executing on our 
testbed. 

Regarding the hard-real-time parallel application, we are 
unfortunately not aware of any such available application that 
matches the malleable job model used in this pape^j However, 
given the continuous march of the real-time and embedded 
computing domains towards increasingly parallel architectures, 
we fully expect that such applications will be developed in 
the near future. Thus, it behooves us to obtain as close to 
realistic as possible parameters for such future parallel real- 
time applications. We have developed a methodology with this 
goal in mind. Below is a high-level overview of the steps of 
our design methodology. The details for each step are in the 
next subsection. 

1) Modify Testbed: We have modified a multicore platform 
to obtain accurate instantaneous CPU power readings. 
Furthermore, our hardware testbed has the ability to run 
at a discrete set of frequencies and turn off individual 
cores. Thus, our platform can approximately implement 
the frequencies determined from the frequency/processor- 
selection algorithm (Section |V). 

2) Obtain Realistic Speedup Vectors: Since we do not 
possess a hard-real -time application with malleable par- 
allel jobs, we have observed the execution behavior of 
two different non-real-time parallel benchmarks (an I/O- 
constrained and non-I/O-constrained application) over 
different processing frequencies and levels of parallelism. 
Our observations are used to construct two realistic 
speedup vectors to use in our stimulation (Step 0J, 

3) Obtain Realistic Power Rates: Using the same non- 
real-time parallel benchmarks, we also construct a matrix 
of power dissipation rates over a range of processing 
frequencies and number of active cores. Again, our mea- 
surements are utilized in the next simulation step. 

4) Power-Savings Simulation: After obtaining the speedup 
vectors and corresponding power dissipation rates, we 
evaluate our algorithm over randomly generated task 
systems. Our frequency/processor-selection algorithm is 
compared against the power required by an optimal non- 
parallel real-time scheduling approach (e.g., Pfair fl5\ ). 

2 In fact, we are also unaware of any commercially or freely-available 
application for any of the other hard-real-time parallel job models. 
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B. Methodology Details 

1 ) Testbed: For our testbed platform, we use an Intel z'7 950 
processor with eight cores (four physical cores with each 
physical core having two "soft" cores - i.e., hyperthreads). 
The processor supports 13 different frequency settings. (The 
processor sets the frequency level and all cores execute at 
the global frequency). We use a Linux 2.6.27 kernel with 
PREEMPT-RT patch as our operating system. In addition, we 
have developed kernel modules for individual core shutdown 
and for frequency modulation functionality. 

The testbed requires a few hardware modifications to mea- 
sure the actual CPU power usage. Towards this goal, we 
connect four shunt resisters, in-series (.05J7 each), with the 
four-wire eATX power connector interfaces of the mother- 
board (each 12V power line is shunted with 0.05J7 resisters). 
We measure the current (A) drawn by the CPU using Na- 
tional Instrument's NI 9205 Data Acquisition unit. Then, we 
calculate the total instantaneous CPU power (as the sum of 
all the individual powers) through each eATX +12V mother- 
board connectors. We run the testbed under the 13 different 
supported frequencies and active number of cores settings 
and record the corresponding power dissipation rates for the 
system. When the number of active cores is less than eight, 
there is a choice of which core to shutdown. To address this 
choice, we consider all the possible shutdown scenarios for 
a given number of active cores and use the average of the 
power-rate of all the scenarios. For example, in our eight-core 
processor, we have seven different ways to shutdown a single 
cor^J We calculate the power consumption of the system for 
each individual case and the average power is recorded as our 
final power-rate measurement for the combination of the given 
frequency and number of active cores. 

2) Speedup Vectors and Power Functions: From our 
testbed, we can generate both a speedup vector and power- 
dissipation-rate function for non-I/O-constrained (i.e., CPU- 
bound) and I/O-constrained (i.e., memory-bound) parallel ap- 
plications. In order to obtain these parameters, we use two 
parallel applications: a modified version of Jetbench [16| for 
an non-I/O-consrained application and a modified parallel 
version of the GNU Compiler Collection (GCC) [17| for 
an I/O-constrained application. Jetbench is an Open Source 
OpenMP-based multicore benchmark application that emulates 
the jet engine performance from real jet engine parameters and 
thermodynamic equations presented in the NASA's EngineSim 
program. For GCC, using the "-j" option for GNU Make [18], 
we concurrently compile a collection of source code files under 
variable number of active processor cores. 

To obtain the speedup vectors for both Jetbench and GCC, 
we execute the applications upon different numbers of active 
cores, recording for each number of cores the response time 
for the application. The speedup for x number of cores is 
determined by the ratio between the response time on one core 
to the response time of the application running concurrently 
on x cores. Figure [T] plots the speedup vector for the two 

3 We cannot shutdown the core (boot core). 
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Fig. 1. Speedup Vectors for Jetbench and GCC 




Fig. 2. Power Function Over Varying Number of Active Cores and 
Frequencies for Non-I/O-Constrained Parallel Application (Jetbench) 

applications. Not surprisingly, Jetbench benefits more greatly 
from increasing number of processors due to the CPU-bounded 
nature and inherently parallelizable workload. 

For determining the power-dissipation rates for both Jet- 
bench and GCC, we execute these applications for all combi- 
nations of frequency and number of active cores and record 
both the power-dissipation rate and the speedup values for the 
application. The power-dissipation rates are determined using 
the measurement hardware described above in Section IVI-B1I 
Each recorded value is an average of the power measured at 
a lms sampling intervals for the duration of the application. 
Figure |2]plots the power-dissipation function for Jetbench; Fig- 
ure [3] plots the power-dissipation function for GCC. Observe 
that the power-dissipation level for Jetbench are slightly higher 
in most cases than the levels for GCC; this is likely due to 
the fact that GCC idles the processor more often during I/O 
operations. 

3) Power-Savings Simulation: We randomly generate task 
systems using a variant of the UUnif ast-Discard algo- 
rithm by Davis and Burns |fl9l . In the UUnif ast-Discard 
algorithm, the user supplies a desired system-level utiliza- 
tion and number of tasks, and the algorithm returns a task 
system where each task has its task utilization randomly- 
generated from a uniform distribution each task utilization. 
The difference between UUnif ast-Discard and the origi- 
nal UUnif ast algorithm from Bini and Buttazzoo l20l is that 
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Fig. 4. Average power savings for non I/O Constrained workload (jetbench) 
when t/max = -4 



Fig. 3. Power Function Over Varying Number of Active Cores and 
Frequencies for I/O-Constrained Parallel Application (GCC) 



UUnif ast-Discard generates task systems with system 
utilizations exceeding one, but task utilizations at most one. 
These restrictions make UUnif ast-Discard appropriate 
for multiprocessor scheduling settings with non-parallel real- 
time jobs. To extend the UUnif ast-Discard, we modified 
the algorithm to permit task utilizations to exceed one (i.e., 
a job is required to execute on more than one processor 
to complete by its deadline) and fix a single task at a 
given maximum utilization. We call our extended algorithm 
UUnif ast-Discard-Max. The utilization for each task 
generated by UUnif ast-Discard-Max (except for the task 
with fixed maximum utilization) is drawn from a uniform 
distribution. 

Using the random-task generator, we generate task systems 
with a total of eight tasks. The total system utilization is varied 
from 1.5 to 8 and the UUnif astDiscard_max algorithm 
assigns a maximum utilization to the first task. We run our 
testbed with maximum utilization value U mSLX (i.e., max™ =1 Ui) 
equal to .4, .8, and 1.2 in our simulations. Also, to match our 
testbed settings and the simulations, we select the number of 
CPUs from 1 to 8. The simulation runs for all the possible 
values of 1.5 to 8 utilization in .1 increments and number of 
available cores is varied from one through eight. We run a 
variant of the Algorithm [2] that iterates through all frequencies 
and number of active core combinations, instead of using a 
binary search. (Our power function does not exactly satisfy the 
non-decreasing property required for binary search to work). 
In each utilization point, we store the exact frequency returned 
by our algorithm. For comparison, we determine the minimum 
frequency required for a optimal (non-parallel) scheduling 
algorithm to schedule the same task system. This value can be 
obtained by solving the following for /: for any task system r, 
U{t) ^ mf. Using these resulting frequencies, we obtain the 
optimal minimum frequency for the non-parallel and parallel 
settings. We then use these frequencies to look up the power- 
dissipation rates for the respective application by using the 
functions displayed in Figures [2] and [3] In the next subsection, 
we plot the power savings; i.e., we plot the power-dissipation 
level obtained from our algorithm minus the power-dissipation 




Fig. 5. Average power savings for non I/O Constrained workload (jetbench) 
when (7 m ax = -8 

level required for the optimal non-parallel algorithm. Each data 
point is the average power saving for 1000 different randomly- 
generated task systems. 

C. Results & Discussion 

Figures [4] |5J and [6] display the power savings obtained from 
simulating over the parallel/power parameters obtained from 
the Jetbench application. Figures [7] [8] and|9]display the power 
savings for the GCC application. The largest power savings is 
60 watts (for GCC when U max — .4 and both the utilization 
and active cores equal eight) which is significant since from 
Figures [2] and [3] the maximum power dissipation rate is around 
80 watts. 

From these plots, there are a few noticeable trends: 1) as 
tmax increases, the power savings decrease for both applica- 




Fig. 6. Average power savings for non I/O Constrained workload (jetbench) 
when [/ m ax = 1.2 
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VII. Conclusions 




Fig. 7. Average power savings for I/O Constrained workload (gcc) when 

(/max = -4 




Fig. 8. Average power savings for I/O Constrained workload (gcc) when 

tions; the reason for this decrease is that larger utilization jobs 
require greater parallelization and thus more parallel overhead 
which reduces the power savings. 2) As the total utilization 
increases, the power savings increases (for active processors 
greater than two); in this case, the savings appears to be due to 
the fact that the power-dissipation rates are considerably higher 
at the highest core frequencies. Thus, if our parallel algorithm 
can reduce the frequency over the non-parallel algorithm by 
a slight amount, there is significant power savings. 3) The 
power savings for both applications are similar; however, the 
I/O-constrained application, GCC, appears to have slightly 
higher power savings. Again, the power-dissipation function 
for GCC may reward small frequency reductions slightly more 
than Jetbench's function. Also, since we have a discrete set 
of frequencies, many of the different frequencies returned by 
Algorithm [2] will get mapped to the same core frequency 
reducing the differences for the two applications. 




Fig. 9. Average power savings for I/O Constrained workload (gcc) when 

Umax — 1.2 



In this paper, we explore the potential energy savings that 
could be obtained from exploiting parallelism present in a 
real-time application. We consider the case of malleable Gang 
scheduled parallel jobs and design an optimal polynomial-time 
algorithm for determining the frequency to run each active core 
when we have the constraint of homogenous core frequencies. 
Simulations with power data from an actual hardware testbed 
confirm the efficacy of our approach by providing significant 
power savings over the optimal non-parallel scheduling ap- 
proach. As real-time embedded systems are trending toward 
multicore architecture, our research suggests the potential in 
reducing the overall energy consumption of these devices by 
exploiting task-level parallelism. In the future, we will extend 
our research to investigate power saving potential when the 
cores may execute at different frequencies and also incorporate 
thermal constraints into the problem. 
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